Using strutural variation to analyze genomic differences for the prediction of heterosis

ABSTRACT

A novel method for prediction of the degree of heterotic phenotypes in plants is disclosed. Structural variation analyses of the genome are used to predict the degree of a heterotic phenotype in plants. In some examples, copy number variation is used to predict the degree of heterotic phenotype. In some methods copy number variation is detected using competitive genomic hybridization arrays. Further, methods for optimizing the arrays are disclosed, together with kits for producing such arrays, as well as hybrid plants selected for development based on the predicted results.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Patent Application Ser. No.61/017,227 filed Dec. 28, 2007, which is herein incorporated byreference in its entirety.

FIELD OF THE INVENTION

This invention relates to the field of plant molecular biology and plantbreeding, particularly the prediction of the degree of heteroticphenotypes in plants.

BACKGROUND

Agricultural output has risen dramatically during the last half of thetwentieth century. A large portion of this increase has been attributedto the development and use of hybrid seed varieties in core crops suchas corn, sorghum, sunflower, alfalfa, canola and wheat. The success ofhybrid seed varieties is due to a phenomenon called heterosis, wherehybrid plants display a more desirable phenotype than either of the twoinbred parental lines used to produce the hybrid plant. Heterosis hasbeen observed in a number of plant traits including yield, plant height,biomass, resistance to disease and insects, tolerance to stress andothers. These heterotic traits are polygenic in nature, resulting intheir characteristic range of phenotypes, rather than traditionaldiscrete Mendelian phenotypes. The polygenic nature of the traitsresults in complex patterns of inheritance such that the underlyingcomponents for the observed heterotic phenotypes is still a matter ofdebate in the plant science community.

Because of the economic value of heterosis, there have been severalattempts to use molecular biology techniques to augment traditionalhybrid plant breeding programs. The bulk of the efforts have focused oneither mRNA (messenger RNA) or genomic DNA. The mRNA approach isextremely difficult as comparisons require tissue samples selected fromthe same portion of the plant, at the same developmental time, and inthe same or highly similar environmental conditions. The process isfurther complicated as a researcher needs to determine which plantportion or developmental stage will yield the best results forpredicting the degree of a particular heterotic phenotype of interest.As a result of these complications, mRNA-based predictions frequentlyhave high levels of noise and have low accuracy in the prediction of thedegree of a heterotic phenotype.

The use of genomic DNA to predict the degree of one or more heteroticphenotypes has been similarly disappointing. Initial efforts usedsubtractive hybridization or fluorescent in situ hybridization in orderto identify copy number differences in inbred plant lines. Thesetechniques do not produce easily quantifiable results and can onlydetect gross differences in copy numbers, such as a doubling or completeelimination. This is a significant problem in polyploid plants aschromosomal duplications and other evolutionary events have resulted ingenes with multiple copies, some of which are pseudogenes, throughoutthe plant genome. These higher copy numbers greatly reduce theusefulness of the genomic DNA approaches as they are unable toaccurately detect the addition or deletion of a single copy of a generepresented three or more times in the genome.

Another genomic approach has been the use of genetic markers to predictheterosis. In these techniques, RFLP markers as well as othertraditional markers have been used. Researchers have attempted to usegenetic markers to predict the degree of a heterotic phenotype with somesuccess, so long as the potential parent plants belong to the sameheterotic groups that were used in the initial crosses to generate thecorrelational data upon which the prediction is based. Once plants fromother heterotic groups are used, the heterotic phenotype predictiveability of genetic markers greatly diminishes. The reason for the lossof predictive ability has been attributed to insufficient linkage of themarkers to quantitative trait loci controlling the trait of interest,and a lack of gametic phase linkage disequilibrium between the markerand quantitative trait loci alleles. This diminished predictive abilityseverely limits the use of genetic markers in plant breeding programs.

Based on these efforts, the application of molecular biology techniquesto the prediction of the degree of a heterotic phenotype has beenproblematic at best. Despite years of research, there has yet to be asatisfactory method developed.

Comparative Genome Hybridization (CGH) is a technique that has beenemployed to study chromosomal abnormalities in animal cells. A majorarea of CGH use has been in analyzing cancer mutations in an effort tobetter identify cancer cells in order to select more effective coursesof therapy. CGH is particularly effective in animal cells as there aretypically two copies of any given gene in the genome (one from eachparent). Additionally, entire genomes for mammals are currently known.Researchers have been able to take advantage of the low duplication andgenome sequence information to identify duplicated and deletedchromosomal regions. This information can then be used to identify thechanges that have transformed normal cells into cancerous cells.However, the complete genome sequence of several major crops is notknown at present. As a result, there has been little use of CGH inplants and doing so requires overcoming the numerous differences thatarise when working with plant genomics.

SUMMARY

The present invention relates to the use of structural variationanalyses of the genome, such as copy number variation analysis, detectedfor example by using comparative genomic hybridization, to predict thedegree of a heterotic phenotype progeny in plants. In one aspect of theinvention, groups of oligonucleotide probe molecules are contacted withplant genomic DNA and the resultant mixture of hybridized probes andgenomic DNA is analyzed so as to determine probes that show differinghybridization levels between two different parents. The results are thenused to predict the degree of a heterotic phenotype of progeny plantsderived from the two parental lines. In another aspect of the invention,the predicted degree of a heterotic phenotype is used in the developmentof hybrid plants. In yet another aspect of the invention, a subset ofoligonucleotide probe molecules that are good predictors of the degreeof a heterotic phenotype are selected from a larger population ofoligonucleotide probe molecules and the selected subset is then used infuture assays to predict the degree of a heterotic phenotype. Anotheraspect of the invention is a kit comprising the selected oligonucleotideprobe molecule subset that can be used for the prediction of the degreeof a heterotic phenotype in plant lines. Other features will bediscussed in greater detail in the description below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows yield predictions based on a PLS regression model builtusing the intensity ratios selected from the genetic algorithm and threelatent variables. This PLS regression model was used to predict yieldfor three additional inbreds: PHBE2, PHHB4 and PHB37, hybridized on two44,000 oligonucleotide probe arrays.

FIG. 2 shows yield predictions based on a PLS regression model builtusing the genetic algorithm selected intensity ratios for all nine ofthe inbreds: PHN46, PHR03, PHB73, PHW52, PHK29, PHW61, PHBE2, PHHB4 andPHB37, and ratios of six of the inbred compared to a replicate measureof PHP38, PHN46, PHR03, PHB73, PHW52, PHK29 and PHW61. The number oflatent variables was increased to five and autoscaling was performed toaccount for this noise. Mean centering was performed on the yield data.

FIG. 3 shows data exemplifying genetic diversity within a maizegenotype. Representative data for two oligos showing copy numbervariation between plants are shown.

FIG. 4 shows data exemplifying genetic diversity within a maizeheterotic group. Representative data showing copy number variationbetween two stiff stalk maize inbreds are shown.

FIG. 5 shows data exemplifying genetic diversity between two maizeheterotic groups. Representative data showing copy number variationbetween a stiff stalk maize inbred and a non-stiff stalk maize inbredare shown.

FIG. 6 shows yield prediction data from copy number variations detectedby comparative genomic hybridization.

FIG. 7 shows ear height prediction data from copy number variationsdetected by comparative genomic hybridization.

FIG. 8 shows moisture prediction data from copy number variationsdetected by comparative genomic hybridization.

FIG. 9 shows plant height prediction data from copy number variationsdetected by comparative genomic hybridization.

DETAILED DESCRIPTION

The following terms will be used frequently in the description thatfollows. The following definitions are provided to facilitateunderstanding of the disclosure.

“Coding regions” means the regions of an organism's genome that code forproteins or RNA molecules, wherein the coding regions and/or the RNA mayinclude introns, exons, regulatory sequences, and 5′ and 3′ untranslatedregions.

“Copy number variation” (CNV) is a segment of DNA for which copy-numberdifferences have been found by comparison of two or more genomes, orcomparison to a reference sequence. The term CNV encompasses otherterminology to describe variants including large-scale copy numbervariants (LCV), copy number polymorphisms (CNP) and intermediate-sizedvariants (ISV).

“F1 hybrid plant variety” means the first filial generation resultingfrom crossing two distinct parental lines.

“Heterosis-related phenotype” means an observable trait in a plant wherethe phenotype exhibited in hybrid plants is more desirable when comparedto the corresponding phenotype exhibited in homozygous parent plants.

“Hybridization intensity” means a measure of the quantity of genomic DNAhybridized to an oligonucleotide probe molecule based on a quantifiablemarker linked to the prepared genomic DNA. The quantity of prepared DNAbinding to the oligonucleotide probe molecule reflects the sequencesimilarity between the genomic DNA and the oligonucleotide probemolecule as well as the copy number of the region of the genomic DNAbound to the oligonucleotide probe molecule.

“Hybridization pattern” means a collection of the hybridizationintensities for each unique oligonucleotide probe molecule in aplurality of oligonucleotide probe molecules after the probe moleculeshave been placed in contact with a sample of DNA or RNA.

“Oligonucleotide array” means a plurality of oligonucleotide probemolecules stably associated with a solid support.

“Oligonucleotide probe molecules” means short sequences of DNA and/orRNA that will selectively hybridize with a prepared sample containingDNA and/or RNA.

“p-value” means a measure of probability that an observed differencebetween hybridization intensities happened by chance. For example, ap-value of 0.01 (p=0.01) means there is a 1 in 100 chance the resultoccurred by chance. The lower the p-value, the more likely it is thatthe difference observed between hybridization intensities was caused byactual differences between the two samples.

“Prepared genomic DNA” means DNA from an organism that has been digestedand/or sheared and labeled with a detectable marker. Furthermanipulation of the DNA may be made, including PCR amplification of theDNA before the DNA is digested and/or sheared, between thedigesting/shearing step and the labeling step, or after the labelingstep. Techniques may also be applied to select for a subset of genomicDNA, such as, for example, methyl sensitive restriction enzymescreening, use of melting curves and selection based on speed ofrefolding, use of Cot DNA, and the like. Such subsets of genomic DNA areincluded within this definition.

“Structural variation” refers to the changes in genetic structure thatoccur in the genome. A wide range of structural variation can occur inthe genome including deletions, insertions, duplications, andinversions. These variations range in size, and are typically grouped1-500 bp, (fine-scale), 500 bp-100 kb (intermediate-scale), and >100 kb(large-scale) in size. As used herein, structural variation does notinclude RFLPs.

Any method can be used to detect, quantify, and/or analyze copy numbervariation between two or more genomes. For example copy number variationcan be discovered by cytogenetic techniques such as fluorescent in situhybridization, comparative genomic hybridization, array comparativegenomic hybridization, large-scale SNP genotyping, whole genomesequencing, paired-end mapping, clone-end resequencing, in silicoanalyses, or combinations thereof. Optionally, computer or statisticalanalyses and/or modeling may be used in conjunction with any CNVmethods.

Copy number variation detection is distinct from typical singlenucleotide polymorphism detection. Hybridization with shortoligonucleotides on solid surfaces may be used to detect singlenucleotide polymorphisms (SNP) (Ghee, et al., (1996) Science274:610-614). In this SNP detection application, 20-22 meroligonucleotides are usually used to maximize the ability to detectsingle mismatches between the probe and target (Lipshutz, et al., (1995)Biotechniques 19:442-447). Longer oligonucleotides, such as the 60-mersused in Example 1 for CGH hybridize with very similar affinity toperfectly matched targets and to targets with one or even twomismatches. Therefore such oligonucleotide probes are not suitable forSNP detection. These longer probes are typically very sensitive to thepresence or absence of the target sequence, or to the large changes inthe quantity of the target sequence, and are therefore useful fordetecting copy number variation. In maize, SNP polymorphisms occur incoding regions with an overall frequency of less than 1 SNP/100 by(Ching, et al., (2002) BMC Genet 3:19). Most of the probes used inExamples 1-2 contain 0-1 mismatch as compared to the genomic DNA, andhybridize well to the target. In the rare occasions of deletion ormultiplication of the target in the genome, such probes will be expectedto show numerically large ratios of hybridization signal betweendifferent inbreds, and to deviate from the expected 1:1 ratios expectedfor targets that are identical or contain 1 mismatch. In Examples 1-2probes with large observed hybridization ratio between different maizeinbreds were selected, although no specific representation is made as tomolecular differences underlying such hybridization ratios except thatthey are unlikely to be due to the presence of 1-2 by differencesbetween probe and target.

In one example the method described herein utilizes CGH to predict thedegree of one or more heterotic phenotypes in hybrid plant varieties.The disclosed method allows for selection of inbred parental lines,while avoiding the need to perform resource-consuming test crossesacross a large number of potential parental lines. This method may beused with a number of oligonucleotide probe molecules ranging from alarge to an unexpectedly low number of oligonucleotide probe moleculesfor prediction of the degree of heterotic phenotypes. The selection ofoligonucleotide probe molecules can be facilitated by the use of anoptimization procedure, an example of which is described herein.Additionally, the disclosed CGH method provides an unexpectedlysignificant increase in predictive ability over techniques currentlyused in plant breeding. The use of CGH also eliminates many of thedifficulties experienced in the use of mRNA for the prediction of thedegree of one or more heterotic phenotypes in plants, as the genomic DNAis the same in every somatic cell in the plant (apart from gametophytes)regardless of the developmental stage, environmental conditions, or thetissue sampled. These results indicate that CGH is a reliable assay forthe prediction of the degree of one or more heterotic phenotypes inplants.

A review of CGH, including the general considerations and a descriptionof the technology, may be found in Pinkel and Albertson, (2005) NatureGenetics 37:S11-S17, and is incorporated by reference herein in itsentirety. The familiarity with CGH technology of those of ordinary skillin the art is therefore assumed in the foregoing description. Using themethod claimed to predict the degree of one or more heterotic phenotypesincludes the selection of a plurality of oligonucleotide probemolecules, obtaining sample genomic DNA, preparing the genomic DNA,hybridization of the sample DNA with the oligonucleotide probemolecules, detection of the resultant hybridization intensities,comparison of the intensities detected with results from one or moreother samples with known heterotic phenotypes and predicting theheterotic phenotype of progeny plants derived from the plants thatprovided the genomic DNA.

One way to improve the disclosed methods is the selection of theplurality of oligonucleotide probe molecules. In one example theplurality of oligonucleotide probe molecules comprises anoligonucleotide array. In some examples, an oligonucleotide arraydesigned for mRNA analysis can be used as the plurality ofoligonucleotide probe molecules. Optionally, the oligonucleotide arraycomprises oligonucleotide probe molecules covering the entire plantgenome, with redundant sampling of each region of the genome as well aspositive and negative controls. In some examples, the oligonucleotidearray comprises oligonucleotide probe molecules that are known to bepredictive of the degree of a heterotic phenotype in the target plant.

When selecting oligonucleotide probe molecules for use, factors such asmolecule size, molecule composition, and the genomic location of themolecules selected may be considered. Regarding molecule size, smallermolecules are less able to hybridize with sequences that containmismatches, including insertions, deletions, or substitutions, but areless susceptible to the formation of secondary structures. Longeroligonucleotide probe molecules are more able to hybridize to DNAcontaining mismatches, but are more susceptible to the formation ofsecondary structures.

Oligonucleotide probe molecules that form secondary structures are lessable to hybridize with the prepared sample genomic DNA. The predictionof secondary structures in oligonucleotide sequences is well known andthere are several software packages that are able to predict secondarystructure formations and thermodynamic properties such as mFOLD (Zuker,et al. (1999) Algorithms and Thermodynamics for RNA Secondary StructurePrediction: A Practical Guide in RNA Biochemistry and Biotechnology,Barciszewski and Clark, eds., NATO ASI Series, Kluwer AcademicPublishers) and RNAfold (Vienna RNA Package; Hofacker, et al., (1994)Monatshefte f. Chemie 125:167-188; Zuker and Stiegler, (1981) Nucl AcidsRes 9:133-148). Using these tools, it is possible to balance thecoverage of genomic locations with the likelihood of secondary structureformation. When using a comprehensive oligonucleotide probe moleculeset, the oligonucleotide probe molecules may be selected such that theentire plant genome is covered multiple times with probes that are notlikely to form secondary structures. When using a smalleroligonucleotide probe molecule set, the probes may be selected to coverthe genomic regions of interest with redundant coverage while stillmaintaining a low likelihood of forming secondary structures.

The oligonucleotide probe molecules used in the methods are generallybetween 20 and 100 nucleotides in length. In some examples, theoligonucleotide probe molecules are 60 nucleotides in length. Of course,the oligonucleotide probe molecules in a given plurality need not all beof uniform length, and in some examples having oligonucleotide probemolecules of differing lengths may utilize or compensate for the varyingcharacteristics of oligonucleotide probe molecules of various lengthsdescribed above.

The quality of data produced by the method can be increased byincorporating more than one oligonucleotide probe molecule per gene orgenomic region of interest. The inclusion of these redundantoligonucleotide probe molecules provides internal checks to determine ifthe differing hybridization intensities are the result of a differencein copy number of a gene or chromosomal region or random noise. In someexamples, more than one oligonucleotide probe molecule per gene or DNAregion of interest is included in the plurality of oligonucleotide probemolecules. In some examples three oligonucleotide probe molecules areused for each gene or region of interest.

The process of creating oligonucleotide arrays is well known and anumber of commercial machines are available for creating oligonucleotidearrays, such as the BioOdyssey Calligrapher MiniArrayer by BioRad.Additionally, there are a number of commercial services that will createoligonucleotide arrays from a list of oligonucleotide probe moleculesequences, such as the SurePrint microarray printing service by Agilent.The plurality of oligonucleotide probe molecules typically includes atleast about one hundred oligonucleotide probe molecules but can includeany number of oligonucleotide probe molecules between about 100 to about80,000 oligonucleotide probe molecules, or more if greater testingranges are desired. Additionally, the plurality of oligonucleotide probemolecules can be designed to include any number of positive or negativecontrols to ensure validity of the data acquired by use of the pluralityof oligonucleotide probe molecules.

Another aspect of the claimed method is the preparation of genomic DNAprior to contact with the plurality of oligonucleotide probe molecules.Preparation and labeling of genomic DNA is well known, and kits for thepreparation of genomic DNA for CGH are available, such as the “GenomicDNA Labeling Kit PLUS” (Agilent). Genomic DNA is isolated from eachparent line and individually labeled. Typically, approximately equalquantities of DNA from each parent are used, otherwise the accuracy ofthe results regarding differences in copy number may suffer, and thus bepotentially less effective as predicting the degree of aheterosis-related phenotype of interest. The amount of isolated genomicDNA required depends on a number of factors, including the size of theoligonucleotide array and the protocols used. When a medium-sizedoligonucleotide array (between about 40,000 and 100,000 oligonucleotideprobe molecules) is used following standard protocols, the amount ofgenomic DNA used is typically between 0.2 and 3.0 μg. When the sampledoes not contain sufficient genomic DNA for direct hybridization, anywell known amplification technique (e.g., PCR amplification) can be usedto increase the quantity of prepared genomic DNA.

Typically, once a sufficient quantity of genomic DNA is available, thegenomic DNA is fragmented using standard techniques such as digestionwith at least one restriction endonuclease, mechanical shearing, or acombination thereof, to provide genomic DNA fragments of relativelyuniform length. The fragmented, genomic DNA sample may then be purified,quantified, and concentrated using standard techniques. The resultantconcentrated genomic DNA fragments may be labeled in a PCR reactionusing random primers and labeled dUTP molecules with each parent havinga unique fluorescent label. If using different oligonucleotide arraysfor each parent, it is then possible to use the same label with bothparents, although typically both samples are analyzed on a single array.Optionally, it is also possible to use more than two labels foradditional potential parents.

Generally genomic DNA is extracted from tissue samples that are eitherfresh or frozen. Any tissue storage method can be used, the goal beingto reduce the degradation of the genomic DNA. Additionally, signalstrength can be improved by the elimination of low-complexity DNA usingstandard techniques such as methyl sensitive restriction enzyme screensof the genome, the use of melting curves with selection based on thespeed of refolding, and the use of Cot DNA to precipitate low complexitysequences.

After DNA preparation, the prepared DNA is contacted with the pluralityof oligonucleotide probe molecules. The prepared and labeled genomic DNAis typically contacted with an oligonucleotide array under stricthybridization conditions. Techniques and conditions required forhybridization of sample DNA to oligonucleotide arrays are known, andkits containing the requisite solutions and buffers are commerciallyavailable, such as the Oligo aCGH/ChIP-on-chip Hybridization Kit(Agilent, Santa Clara, Calif., USA). Prepared genomic DNA from theparents is typically hybridized to the same oligonucleotide array.Alternatively, the prepared and labeled genomic DNA of each parent maybe hybridized to different arrays at different times so long as thedifferent arrays contain at least some subset of common oligonucleotideprobe molecules. In some examples the DNA from each parent is hybridizedto two separate but identical array of oligonucleotide probe moleculesunder the same hybridization conditions.

After contacting the prepared DNA with the plurality of oligonucleotideprobe molecules, the hybridization intensities generated by thehybridization of the genomic DNA with the oligonucleotide probemolecules are detected. Optionally, a commercial microarray scanner(such as an Agilent DNA Microarray Scanner) is used to detect thehybridization intensities. The detected hybridization intensities aretypically displayed on software associated with the scanner and can beoptionally exported into any number of file formats for advancedprocessing. The data analysis software can generate statistics based onthe detected hybridization intensities. This enables a researcher todetermine the number of probes displaying differing hybridizationintensities and the degree of the intensity differences. In someexamples, the software is used to determine the number of differences,the fold difference, or both, of oligonucleotide probe moleculesdisplaying a greater than 1.5-fold difference in hybridizationintensity. Optionally, the software can be used to determine the numberof oligonucleotide probe molecules displaying at least a 2 folddifference in hybridization intensity. In some examples, the softwarecan be used to determine the number of oligonucleotide probe moleculesdisplaying a greater than three-fold, but less than ten-fold differencein hybridization intensity. Of course, other values can be used foreither the minimum fold difference and/or the maximum fold difference,if one wanted to either narrow or broaden the group of relevanthybridization intensities. For example, minimum fold differences mayinclude any value between a 1.5 fold difference and a 10 folddifference, and the maximum fold difference may include any valuebetween 1.5 fold difference and a 50 fold difference. These minimum andmaximum cutoffs can either be used independently (e.g., alloligonucleotide probe molecules displaying a difference in hybridizationintensity greater than 1.7) or together (e.g., all oligonucleotide probemolecules displaying a greater than 2.1 but less than 11.4 folddifference) to provide data sets for further processing.

In another example, whole genome sequencing methods can be used todetect copy number variation. Whole genome shotgun sequencing of small(4000 to 7000 bp) genomes was in use in 1979 (Staden, (1979) Nucl AcidsRes 6:2601-2610). The methodology has evolved to enable sequencing oflarger more complicated genomes, including the fruit fly genome and thehuman genome. In general, high molecular weight DNA is sheared intorandom fragments, size-selected (usually 2, 10, 50 and 150 kb), andcloned into an appropriate vector. The clones are sequenced from bothends, typically using a chain termination method to yield two shortsequences. Each sequence is called an end-read or read and two readsfrom the same clone are referred to as mate pairs. The chain terminationmethod typically produces reads of about 500-1000 bases, therefore matepairs rarely overlap. The original sequence is reconstructed from all ofthe reads using sequence assembly software. Overlapping reads arecollected into longer composite sequences known as contigs. Contigs canbe linked together into scaffolds by following connections between matepairs. The distance between contigs can be inferred from the mate pairpositions if the average fragment length of the library is known and hasa narrow window of deviation. Many sequencing technologies are availableusing gel methods, capillary methods, bead methods, or array methods.Rapidly advancing sequencing technologies include sequencing bysynthesis, parallel bead arrays, electronic microchips, biochips,parallel microchips, sequencing by ligation, single DNA moleculesequencing, and nanopore-sequencing. In this example,deletions/insertions would be detected by aligning the sequences to areference genome. CNVs would be detected by counting the number of timesa tag/sequence was observed and then comparing the counts to anothersample or reference genome.

An in silico strategy was used to compare two human genomes at the DNAsequence level (Tuzun, et al., (2005) Nat Genet 37:727-732). The humangenome sequence in NCBI was the reference genome. Approximately 67% ofthis reference sequence was from a single DNA library (the RPCI-11 BAClibrary) from a single individual. The second genome comprised pairs ofend-sequence reads from >500,000 fosmid clones of the G248 DNA library.This DNA library was derived from an anonymous North American female ofEuropean ancestry. Since the sizes of fosmid clones are tightlyregulated at about 40 kb, it was expected that pairs of end sequencesfor any given fosmid clone would align to the reference sequence withabout a 40-kb spacing. Significant deviation of the alignment spacing(i.e., <32 kb or >48 kb) suggested the presence of a CNV at that locus.Using this criterion 241 CNVs were identified, with most in the sizerange of 8 kb to 40 kb, and 80% of these were not previously identified.Also, most of these CNVs were below the expected resolution of the arrayplatforms used in earlier CNV studies. One advantage over array-basedmethods is that the in silico approach also detects other structuralgenomic variants, for example inversions. These structural variants canbe detected by consistent discrepancies in the aligned orientation ofmultiple paired end sequences.

Chemometrics is the application of mathematical or statistical methodsfor experimental design and/or the analysis of data. Chemometrics can beused to identify further information from these data using variousmethods including statistics, pattern recognition, modeling,structure-property-relationship estimations, or combinations thereof.For example, the data can be hybridization data, hybridizationintensity, p-values for intensity measures, hybridization intensityratios, normalized data, sequencing data, sequence analysis output suchas contigs, alignments, similarity scores, expected value scores,p-values, indels, or other data generated by a method to detect genomicstructural variations.

In some examples, data analysis software is used to calculate p-valuesbased on the measured differences in hybridization intensity. Thesevalues may be used as substitutes for or in addition to the folddifferences in the intensity between oligonucleotide probe molecules.When using the p-value in lieu of fold difference one can increase thestringency by decreasing the maximum p-value considered. For example, aresearcher may wish to apply a low stringency cut-off by selecting alloligonucleotide probe molecules where the difference in hybridizationintensity yielded a p-value less than 0.1. The stringency can beincreased by lowering the maximum p-value to 0.05, 0.01, 0.001 or anyvalue within the range of 0.01 to 0.001.

Once the data is collected, the degree of a heterotic phenotype can bepredicted based on the results obtained. This prediction is accomplishedby comparing the number of probes meeting the user defined thresholdduring analysis to the number of probes displaying meeting the samecriteria in other hybridizations involving parents where the heteroticphenotype in resultant F1 hybrid progeny is known. Additionally, commonstatistical techniques, such as linear regression, may be used toperform the prediction.

Optionally, the predicted degree of one or more heterotic phenotypes canbe used to select parental lines for development of F1 hybrid plantlines as part of a plant breeding program. Modern plant breedingprograms take a wide variety of factors into account when selectingplants for breeding. In another example, the predicted degree of aheterotic phenotype is included among the factors and forms at leastpart of the rationale for selecting two parental lines for breeding in acommercial or other plant breeding program.

The methods can be used to develop a plurality of oligonucleotide probemolecules specialized for the prediction of the degree of one or moreheterotic phenotypes. The identification of oligonucleotide probemolecules that are predictive of heterotic phenotypes in a target plantcan be accomplished through the use of an empirical approach. In oneexample a number of F1 hybrid plant lines are created and grown undercontrolled conditions and the heterotic phenotype of interest ismeasured. Using an oligonucleotide array, typically one that covers agreater amount of the plant genome, CGH is performed for the parentallines. The resultant hybridization intensities are analyzed to determinethe oligonucleotide probe molecules that demonstrate better ability topredict the degree of heterotic phenotype in the measured F1 hybridplant lines. The oligonucleotide probe molecules that are betterpredictors are then used in an improved oligonucleotide array to predictthe degree of a heterotic phenotype, either in lieu of or in addition toa comprehensive oligonucleotide array as described above.

In some examples, the analysis of the hybridization intensities isperformed using an iterated evolutionary computational approach. In thisapproach, the software forms arbitrary sub-groupings of theoligonucleotide probe molecules and uses regression analysis todetermine the predictive ability of the probe subsets. The regressionmay be coupled with a machine learning method and used to select thesub-groupings of oligonucleotide probe molecules that demonstrate abetter performance in predicting heterotic phenotypes. Types ofregression analyses that may be used include, for example, principalcomponent regression, classic least squares, inverse least squares, andpartial least squares. Machine learning methods that may be usedinclude, for example, support vector machines and neural networks.Regression and machine learning may be used individually or incombination to perform the analysis. Hybridization intensity predictorselection in the regression analysis alone can be done as shown in someexamples using variable of importance projection within the PLSrepresentation space. The process of forming subgroups and selectingbetter predictors through the use of regression and machine learning mayalso be repeated until a user-defined point. In some examples, theprocess is iterated until there are only slight increases in thepredictive ability of the subsets. In other examples, the process isiterated until there is no increase in the predictive ability of thesubsets.

Optionally, an oligonucleotide array comprising the identifiedoligonucleotide probe molecules is created. In some examples the createdoligonucleotide array is part of a kit for the prediction of the degreeof one or more heterotic phenotypes in a plant that is available forcommercial sale or internal use.

The following examples further illustrate the current invention and arenot intended to limit the claims in any way. The present invention canbe practiced using many different variations and has been shown by meansof illustrative examples. The invention is not limited to theembodiments disclosed but also includes all modifications, equivalents,and alternatives falling within the spirit and scope of the invention asset forth in the claims.

Example 1 Comparative Genome Hybridization (CGH) in Maize Genomic DNA:

Genomic DNA was obtained from the following maize inbreds: PHP38, PHK29,PHW61, PHR03, PHW52, PHN46, PHHB4, PHBE2, PHB37, PH1FA, PHT11 and PHB47.Total cellular DNA was isolated from fresh-frozen leaf samples by usingDNeasy Plant Mini Kits (Qiagen) including an incubation with RNAseAfollowing the instructions of the manufacturer. Samples were quantitatedwith a spectrophotometer and ran on an agarose gel to check forintegrity.

aCGH:

For each CGH hybridization, 2 μg of genomic DNA was digested with AluIand RsaI restriction enzymes (Promega). After a two-hour incubation, thesamples were heated to 65° C. for 20 minutes to inactivate the enzymes.The fragmented DNA was labeled via a random primed labeling reaction(Agilent Oligonucleotide Array-Based CGH for Genomic DNA Analysis, v4.0)that incorporated Cy3-UTP into the product. The labeled DNA was filteredwith a Microcon YM-30 column (Millipore) to remove unincorporatednucleotides. Samples were quantitated with a Hitachi spectrophotometerto measure yield and dye incorporation rates. Hybridization and blockingbuffers (Agilent Technologies) were added to the samples prior to beingdenatured at 95° C. for 3 minutes and incubated at 37° C. for 30minutes. Each sample was hybridized to an array for 40 hours at 65° C.while rotating at 10 rpm. The arrays were disassembled and washed inOligo aCGH Wash Buffer 1 (Agilent Technologies) at room temperature for5 minutes. A second wash was performed in Oligo aCGH Wash Buffer 2(Agilent Technologies) for 1 minute at 37° C. Slides were then dipped inAcetonitrile and air dried. An Agilent G2505B DNA microarray scanner wasutilized to capture the TIF images.

Oligonucleotide Microarrays:

Custom 44K microarrays (Agilent Technologies) containing 82,272 unique60mer oligos spanning two microarrays targeting expressed sequences ofthe maize genome were utilized for the hybridization of the followinginbreds: PHP38, PHK29, PHW61, PHR03, PHW52, PHN46, PHHB4, PHBE2 andPHB37. Additionally, a custom 2×105K microarray (Agilent Technologies)containing 102,349 unique 60mer oligos, of which 82,272 oligos wererepresented on the previous 44K arrays, was utilized for hybridizationof the following inbreds: PHP38, PHK29, PHW61, PHR03, PHW52, PHN46,PHHB4, PHBE2, PHB37, PH1FA, PHT11 and PHB47.

Image and Data Analysis:

The microarray images were visually inspected for image artifacts.Feature intensities were extracted, filtered, and normalized withAgilent's Feature Extraction Software (version 9.5.1). Further qualitycontrol was performed utilizing data analysis tools in Rosetta'sResolver Database.

Nebulization Vs RE Digestion

Samples were randomly sheared via nebulization. 4 to 6 μg of purifiedDNA samples, in a final volume of 50 μl, were mixed in the nebulizerwith 700 μl of nebulization buffer (25% glycerol, 50 mM Tris-HCl, 15 mMMgCl₂). The nebulizer was chilled on ice, and connected to a compressedair source. Air was delivered at a pressure of 32 psi for 6 min. Thenebulizer was spun down and the DNA solution recovered. DNA was purifiedon one QiAquick® PCR Purification column (Qiagen) and eluted in 30 μl of10 mM Tris-HCl pH 8.5. 0.5 μg of randomly sheared DNA was used for thelabeling and hybridization steps previously described.

After hybridization, the data from the restriction enzyme (RE) digestand randomly sheared samples were compared to determine if there is anydifference with sample prep methodology. The comparison of nebulizedsamples vs. RE digested samples showed a high correlation of foldchanges (R squared=0.89). Therefore, there are no major differences withthe data when either sample prep method is utilized.

Example 2 Regression Analyses

The CGH array intensity ratios, values, accession numbers andoligonucleotide probe sequences were exported in ASCII text format usingRosetta Resolver 6.0 (Rosetta Biosoftware, Seattle, Wash.). The CGHintensities were imported and aligned for each inbred and array in theMatlab (ver. 7.4.0, Mathworks, Natick, Mass.) technical computingenvironment using both the accession numbers and oligonucleotidesequences. The genetic algorithm intensity ratio selection using partialleast squares regression analysis was performed using the PLSToolbox 4.0(Eigenvector Research, Wenatchee, Wash.) in the Matlab workspace. Allcalculations were run on a Dell Latitude D620 with 1.8 GHz Intel duocore processor using multi-threaded mode.

Intensity ratio values from the two 44,000 oligonucleotide probe arraysdescribed above were assembled for the inbreds PHB73, PHW61, PHR03,PHK29, PHW52 and PHN46. For the exemplary method described here,p-values less than 0.01 were used to reduce the number of geneticalgorithm predictive candidate intensity ratios from 82435 to 2786. Allintensities and intensities selected by fold change criteria have alsobeen used for inputs for the genetic algorithm.

The genetic algorithm applied to predictive intensity ratio selectionwas the gaselctr.m function from the PLSToolbox. The algorithm wasapplied to an initial population size of 256 unique intensity ratio setswith 10% of the 2786 ratios selected in each individual. Partial leastsquares regression (PLS) of the yield to the selected intensity ratioswas performed for yield prediction. Intensity ratio sets were ranked bytheir PLS yield prediction error. One hundred generations of doublecrossover combining using the 128 best ranking individual intensityratio sets were performed ten times. The number of latent variables inthe PLS regression was set to a maximum of three. The 201 intensityratios selected from this genetic algorithm variable selection methodpredicted yield with the least root mean square error in leave one outcross validation among the 100,000's of intensity ratio sets tested bythe genetic algorithm and a regression model built with all of intensityratios.

A PLS regression model was built using the intensity ratios selectedfrom the genetic algorithm and three latent variables. This PLSregression model was used to predict yield for three additional inbreds,PHBE2, PHHB4 and PHB37, hybridized on two 44,000 oligonucleotide probearrays. These yield predictions were a validation of the model andintensity ratio selection method. The predictions are shown in FIG. 1.The prediction comparisons indicated with a triangle are for the inbredsthat are not a part of the regression model. The asterisks indicate theprediction of the calibration samples.

A PLS regression model was then built using the genetic algorithmselected intensity ratios for all nine of the inbreds, PHN46, PHR03,PHB73, PHW52, PHK29, PHW61, PHBE2, PHHB4 and PHB37, and ratios of six ofthe inbred compared to a replicate measure of PHP38, PHN46, PHR03,PHB73, PHW52, PHK29 and PHW61. The replicates contributed intensityratio noise to the model building. The number of latent variables wasincrease to five and autoscaling was performed to the intensity ratio toaccount for this noise. Mean centering was performed on the yield data.The predictions are shown in the FIG. 2 below for the intensity ratiosderived from the arrays with 20,000 additional oligonucleotides fromcoding regions of the genome. The new arrays were hybridized for thenine inbreds already mentioned and three new inbreds PH1FA, PHT11 andPHB47. The comparisons of predicted yield and measured yield for the newinbreds are indicated by the asterisks. The PLS regression modelcalibration samples are indicated by the triangles. The root mean squareerror of prediction for the new inbred was 9 bu/ac.

The predicted heterosis values will be an approximation of change inyield (bu/ac). This method can be used as a preliminary screening ofgermplasm, particularly new germplasm, and may be used to select asmaller set for experimental measurement of heterosis. In thisapplication, the method provides a reduction of the number of lines tobe evaluated in the field.

This method was validated using a larger set of samples, and with morediverse genotypes. CGH data was generated essentially as described inExample 1, by hybridization against maize CGH arrays in 2×105K format.Samples were for CGH were taken from 14 R2 experiments containing plantsfrom 3 relative maturity groups, representing 181 genotypes (91 stiffstalk, 90 non-stiff stalk inbreds) which produced 914 hybrids. The datawas analyzed to identify oligonucleotides associated with heterosisusing phenotypic data including yield, ear height, moisture, testweight, stay green, plant height, stock lodging and root lodging. Datafrom this analysis was cross-validated with mapping data when available.For stiff-stalk inbred A vs. 36 non-stiff stalk inbreds, putativepredictive oligo sets for yield, ear height, moisture and plant heightwere identified using the variable importance projection methoddescribed in Example 5, and shown in FIGS. 6-9).

Example 3 Comparison of Genomic DNA Preparation Methods Genomic DNA:

Genomic DNA was obtained from the following maize inbreds: PHP38, PHK29,PHW61, PHR03, PHW52, PHN46, PHHB4, PHBE2, PHB37, PH1FA, PHT11 and PHB47.Total cellular DNA was isolated from fresh-frozen leaf samples by usingDNeasy Plant Mini Kits (Qiagen) including an incubation with RNAseAfollowing the instructions of the manufacturer. Samples were quantitatedwith a spectrophotometer and ran on an agarose gel to check forintegrity.

aCGH:

For each CGH hybridization, 2 μg of genomic DNA was digested with AluIand RsaI restriction enzymes (Promega). After a two-hour incubation, thesamples were heated to 65° C. for 20 minutes to inactivate the enzymes.The fragmented DNA was labeled via a random primed labeling reaction(Agilent Oligonucleotide Array-Based CGH for Genomic DNA Analysis, v4.0)that incorporated Cy3-UTP into the product. The labeled DNA was filteredwith a Microcon YM-30 column (Millipore) to remove unincorporatednucleotides. Samples were quantitated with a Hitachi spectrophotometerto measure yield and dye incorporation rates. Hybridization and blockingbuffers (Agilent Technologies) were added to the samples prior to beingdenatured at 95° C. for 3 minutes and incubated at 37° C. for 30minutes. Each sample was hybridized to an array for 40 hours at 65° C.while rotating at 10 rpm. The arrays were disassembled and washed inOligo aCGH Wash Buffer 1 (Agilent Technologies) at room temperature for5 minutes. A second wash was performed in Oligo aCGH Wash Buffer 2(Agilent Technologies) for 1 minute at 37° C. Slides were then dipped inAcetonitrile and air dried. An Agilent G2505B DNA microarray scanner wasutilized to capture the TIF images.

Oligonucleotide Microarrays:

Custom 44K microarrays (Agilent Technologies) containing 82,272 unique60mer oligos spanning two microarrays targeting expressed sequences ofthe maize genome were utilized for the hybridization of the followinginbreds: PHP38, PHK29, PHW61, PHR03, PHW52, PHN46, PHHB4, PHBE2 andPHB37. Additionally, a custom 2×105K microarray (Agilent Technologies)containing 102,349 unique 60mer oligos, of which 82,272 oligos wererepresented on the previous 44K arrays, was utilized for hybridizationof the following inbreds: PHP38, PHK29, PHW61, PHR03, PHW52, PHN46,PHHB4, PHBE2, PHB37, PH1FA, PHT11 and PHB47.

Image and Data Analysis:

The microarray images were visually inspected for image artifacts.Feature intensities were extracted, filtered, and normalized withAgilent's Feature Extraction Software (version 9.5.1). Further qualitycontrol was performed utilizing data analysis tools in Rosetta'sResolver Database.

Nebulization vs RE digestion

Samples were randomly sheared via nebulization. 4 to 6 μg of purifiedDNA samples, in a final volume of 50 μl, were mixed in the nebulizerwith 700 μl of nebulization buffer (25% glycerol, 50 mM Tris-HCl, 15 mMMgCl₂). The nebulizer was chilled on ice, and connected to a compressedair source. Air was delivered at a pressure of 32 psi for 6 min. Thenebulizer was spun down and the DNA solution recovered. DNA was purifiedon one QiAquick® PCR Purification column (Qiagen) and eluted in 30 μl of10 mM Tris-HCl pH 8.5. 0.5 μg of randomly sheared DNA was used for thelabeling and hybridization steps previously described.

After hybridization, the data from the restriction enzyme (RE) digestand randomly sheared samples were compared to determine if there is anydifference with sample prep methodology. The comparison of nebulizedsamples vs. RE digested samples showed a high correlation of foldchanges (R squared=0.89). Therefore, there are no major differences withthe data when either sample prep method is utilized.

Example 4 Genetic Diversity

The methodology outlined in Examples 1-3 were used to generate estimatesof copy number variation genetic diversity in select maize genotypes. Asshown in the art, research in humans has demonstrated copy numbervariation between monozygotic twins (Bruder, et al., (2008) Am J HumGenetic 82:763-771).

A. Plant Variation

DNA from ten maize plants of the same genotype was subjected tocomparative genome hybridization and analysis essentially as describedin Examples 1-3 to identify putative CNVs between the individual plants.The observed variation between plants ranges from 0.09% to 0.38%.Technical variation was also determined, and estimated to be 0.08%.Representative data for two putative CNVs showing Log Intensity vs.plant number is shown in FIG. 3.

B. Variation within a Heterotic Group

In order to estimate the diversity within a maize heterotic group, DNAisolated from two inbreds from the stiff stalk heterotic group wasanalyzed as described in Examples 1-3 to identify copy numbervariations. The observed variation was plotted as a log ratio of the twogenotypes for each individual chromosome as shown in FIG. 4.

C. Variation Between Heterotic Groups

In order to estimate the diversity between two maize heterotic groups,DNA isolated from two inbreds, a stiff stalk inbred, and a non-stiffstalk inbred. The DNA was analyzed as described in Examples 1-3 toidentify copy number variations. The observed variation was plotted as alog ratio of the two genotypes for each individual chromosome as shownin FIG. 5.

Example 5 Chemometrics

Chemometrics have been applied to the hybridization data to identify theoligos likely to be predictive of at least one heterotic phenotype. Theanalyses described in Example 2 are also chemometric methods that can beapplied to genomic structural variation data.

In general, the objective of the chemometric analyses was to predictplant performance based on CGH intensity data. The analyses wereoptimized through selection of variables, including preprocessing andprediction based algorithms. Analysis was validated using one or moretests including a ‘leave one out’ calibration test, prediction for a newsample in the heterotic group, and/or comparison of selectedoligonucleotides to known markers or mapping data. Preprocessingincludes steps such as classification of data based on hybridizationintensity: no variation in reference CGH intensity; less than a 10-foldchange in intensity; and more than a 2-fold change in intensity.Prediction based variable selection includes use of a genetic algorithm(GA), which is a slower but more thorough method, or use of variableimportance projection (VIP), which is a rapid early assessment usingpredictive ranking.

CGH data was generated essentially as described in Example 1, byhybridization against maize CGH arrays in 2×105K format. Samples werefor CGH were taken from 14 R2 experiments containing plants from 3relative maturity groups, representing 181 genotypes (91 stiff stalk, 90non-stiff stalk inbreds) which produced 914 hybrids. The data wasanalyzed to identify oligonucleotides associated with heterosis usingphenotypic data including yield, ear height, moisture, test weight, staygreen, plant height, stock lodging, and root lodging. Data from thisanalysis was cross-validated with mapping data when available. Forstiff-stalk inbred A vs. 36 non-stiff stalk inbreds, putative predictiveoligo sets for yield, ear height, moisture, and plant height wereidentified.

In this experiment, changes in the approach were taken to include anadditional more rapid method of variable selection. CGH intensities wereincluded in the multivariate regression if there was no variation in thereference hybridization data set, the relative intensity for each of theoligos for each inbred was less than ten for all the oligos but greaterthan two for at least one quarter of the inbreds. For the test set“Inbred A”, 34541 out of the 103250 available oligos that met thesepreprocessing selection criteria. A PLS regression model was built foreach of the phenotypic traits yield, ear height, plant height andmoisture using one latent variable. The variable importance in theprojection (VIP score) was then calculated and used to select oligos foran additional model. The VIP threshold for inclusion in the model wasset at least higher than 1 and as high as 10. A second model was thenbuilt with the reduced number of variables and a second VIP selectionwas performed with these variables using similar criteria as the first.After the second variable selection iteration the leave-one-outcross-validation was performed to estimate the prediction error for eachinbred. The predicted traits are compared to the measured traits in theFIGS. 6-9. The chemometric analysis data for Inbred A vs. 36 non-stiffstalk inbreds are summarized in Table 1 below. Within these predictiveoligo sets, there 2 oligos found in common in the yield and plant heightprediction sets, indicating that some traits may be correlated. Aregression of plant height vs. yield data gave an R² value of 0.310.

TABLE 1 Trait Prediction # Oligos R² Validation Yield 18 0.7811 4 oligosmapped to region associated with yield Ear Height 8 0.5838 5 oligosmapped to region associated with ear height Moisture 18 0.6991 8 oligosmapped to region associated with moisture Plant Height 32 0.6362 11oligos mapped to region associated with plant height

Example 6 Whole Genome Sequencing

Other methods that may be used for the detection of genomic structuralvariants, such as copy number variations, insertions, deletions andnucleotide polymorphisms (SNPs) include methods for direct comparativeDNA sequencing of genomes. Direct comparative sequencing can beaccomplished in a number of ways known to those skilled in the art,including but not limited to the approaches below.

For example, whole genome shotgun sequencing and assembly usingfluorescent dideoxynucleotide sequencing can be used to detect andcharacterize structural differences. The genomes of the individual plantlines differing in their genotypes, as determined by genetic markeranalysis or pedigree analysis, are sequenced and then compared to eachother using available bioinformatic software tools. Any differences arecatalogued by type and genomic locations, and their numbers in eachcategory are reported for analysis, for example as described in Examples2 and/or 5

Whole genome shotgun sequencing using ultra-high throughputtechnologies, such as the system provided by Illumina, Inc.(www.illumina.com), can be used to produce a plurality of sequences fromthe genomes of individual plant lines. Sequencing reads produced by thisapproach are assembled, and analyzed as indicated above. Optionally orin addition, the catalog of the sequence fragments obtained, or ofsub-sequences within them (k-mers) is prepared and the two catalogs fromtwo different individuals can be compared. The differences in the numberof fragments in each category are noted, and statistical analysis isperformed to estimate confidence intervals for these abundancedifferences. The catalog of the differences meeting statisticalconfidence criteria is submitted to the analysis as described inExamples 2 and 5, or equivalent methods in the art.

Alternatively, subsets of each genome may be sequenced. For example, asubset can be individual chromosomes obtained by chromosome sorting,genome segments selected by hybridization and subsequent elution frommicroarrays, or a subset generated by any other method known to thoseskilled in the art. The catalog of the differences for the subsets ofeach genome meeting statistical confidence criteria is submitted to theanalysis as described in Examples 2 and/or 5, or other equivalentmethods. In some examples alternative methods of complete or partialgenome sequencing may also be used, providing the methods can produce acatalog of differences in sequences of the genomes being compared.

In one example the direct whole genome sequence involves the followingsteps:

-   -   1) isolate genomic DNA;    -   2) prepare genomic DNA for sequencing, optionally tag the        sequence(s);    -   3) sequence genomic DNA from step 1 (sequencing method may tag        polynucleotides);    -   4) map sequences to the genome and count occurrence of tags;    -   5) after normalization of the data, compare the tags between        samples to determine CNV;    -   6) apply data analysis methods (e.g., Example 2 and/or        Example 5) to relate the CNVs observed to at least one heterotic        phenotype.

Optionally, the isolated genomic DNA from step 1 or step 2 could beprocessed to remove repetitive sequences or otherwise reduce thecomplexity of the sample before sequencing. For example oligos to therepetitive regions could be synthesized and tagged with a biotinmolecule. The biotinylated oligos are added to the DNA, and the sampleapplied to a streptavidin column. The flow-through sample ofnon-repetitive DNA is collected for further analysis. In anotherexample, a microarray that targets the repetitive regions is created.The DNA sample is hybridized to the array such that the unboundfragments are collected and used for sequencing. In another method, thegenomic DNA could be digested using a restriction enzyme, and thensequencing initiated from the RE site.

All publications and patent applications mentioned in the specificationare indicative of the level of those skilled in the art to which thisinvention pertains. All publications and patent applications are hereinincorporated by reference to the same extent as if each individualpublication or patent application was specifically and individuallyindicated to be incorporated by reference. Although the foregoinginvention has been described in some detail by way of illustration andexample for purposes of clarity of understanding, it will be obviousthat certain changes and modifications may be practiced within the scopeof the appended claims. As used herein and in the appended claims, thesingular forms “a”, “an”, and “the” include plural reference unless thecontext clearly dictates otherwise. Thus, for example, reference to “aplant” includes a plurality of such plants; reference to “a cell”includes one or more cells and equivalents thereof known to thoseskilled in the art, and so forth.

1-5. (canceled)
 6. A method comprising: a) contacting genomic DNA from afirst inbred plant with a first plurality of oligonucleotide probemolecules; b) detecting the hybridization intensities for at least asubset of oligonucleotide probe molecules in the first plurality ofoligonucleotide probe molecules; c) contacting genomic DNA from a secondinbred plant with a second plurality of oligonucleotide probe molecules,wherein said first and second plurality of oligonucleotide probemolecules have at least one subset of oligonucleotide probe molecules incommon; d) detecting the hybridization intensities for at least a subsetof oligonucleotide probe molecules in the second plurality ofoligonucleotide probe molecules; e) determining relative measures ofhybridization intensity for a plurality of the individualoligonucleotide probe molecules in said common subset of oligonucleotideprobe molecules; and f) using said relative hybridization intensities topredict the degree of a heterosis-related phenotype for a hybrid progenyplant derived from said first and second inbred plant.
 7. The method ofclaim 6, wherein at least one of said first and second plants comprisesprepared genomic DNA.
 8. The method of claim 6, wherein the plant ismaize.
 9. The method of claim 6, wherein the genomic DNA from the firstand second inbred plants comprises prepared genomic DNA.
 10. The methodof claim 6, wherein said first or said second plurality ofoligonucleotide probe molecules comprise at least 50% oligonucleotideprobe molecules which hybridize to coding regions or othernon-repetitive genomic DNA sequences.
 11. The method of claim 6, whereinsaid subset of oligonucleotide probe molecules contains at least 100oligonucleotide probe molecules.
 12. The method of claim 6, wherein saidsubset of oligonucleotide probe molecules contains no more than 150oligonucleotide probe molecules.
 13. The method of claim 6, wherein theheterosis-related phenotype is plant yield.
 14. The method of claim 6,wherein the plurality of oligonucleotide probe molecules comprise anoligonucleotide array.
 15. The method of claim 6, wherein saidoligonucleotide probe molecules are at least 20 but are not more than100 nucleotides in length.
 16. The method of claim 6, wherein saidrelative hybridization intensity comprises a ratio of hybridizationintensity.
 17. The method of claim 16, wherein oligonucleotide probemolecules exhibiting at least a three fold difference, but less than aten fold difference in hybridization intensity are included in saidcomparison.
 18. The method of claim 6, further comprising selecting saidfirst and second inbred plants for development of a F1 hybrid plantvariety based at least in part on said prediction of a heterosis-relatedphenotype.
 19. The method of claim 6, wherein the relative hybridizationintensities comprise a measurement of copy number variations betweensaid first plant and said second inbred plant. 20-32. (canceled)